Skip to content

Enhancement - Fix power monitoring for ROCm 6.3#794

Open
polarG wants to merge 5 commits intomainfrom
dev/hongtaohang/fix-amd-device-manager
Open

Enhancement - Fix power monitoring for ROCm 6.3#794
polarG wants to merge 5 commits intomainfrom
dev/hongtaohang/fix-amd-device-manager

Conversation

@polarG
Copy link
Copy Markdown
Contributor

@polarG polarG commented Mar 25, 2026

Description
GPU monitor crashes with invalid literal for int() with base 10: 'N/A' — amdsmi_get_power_info() returns 'N/A' for average_socket_power on MI300X (hardware reports 0xFFFF for unsupported fields). The
code called int() on this value without validation. Additionally, power_limit is returned in microwatts but was treated as watts.

Changes
get_device_power(): Fall back to current_socket_power when average_socket_power is 'N/A'. get_device_power_limit(): Convert microwatts to watts when needed.

@polarG polarG requested a review from a team as a code owner March 25, 2026 23:02
Copilot AI review requested due to automatic review settings March 25, 2026 23:02
@polarG polarG self-assigned this Mar 25, 2026
@polarG polarG added enhancement New feature or request ROCm labels Mar 25, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Improves ROCm GPU power monitoring robustness by handling unsupported power fields returned by amdsmi_get_power_info() on newer hardware (e.g., MI300X) and normalizing power_limit units so monitoring doesn’t crash and reports correct watts.

Changes:

  • Update ROCm get_device_power() to fall back from average_socket_power to current_socket_power when the former is unsupported ('N/A').
  • Update ROCm get_device_power_limit() to treat power_limit as microwatts (when applicable) and convert to watts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread superbench/common/utils/device_manager.py Outdated
Comment thread superbench/common/utils/device_manager.py Outdated
Comment thread superbench/common/utils/device_manager.py Outdated
Comment thread superbench/common/utils/device_manager.py Outdated
Comment thread superbench/common/utils/device_manager.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 0% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.18%. Comparing base (700d650) to head (a8a5ce4).

Files with missing lines Patch % Lines
superbench/common/utils/device_manager.py 0.00% 12 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #794      +/-   ##
==========================================
- Coverage   85.69%   83.18%   -2.51%     
==========================================
  Files         103      103              
  Lines        7890     7900      +10     
==========================================
- Hits         6761     6572     -189     
- Misses       1129     1328     +199     
Flag Coverage Δ
cpu-python3.7-unit-test 69.76% <0.00%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

polarG and others added 2 commits April 27, 2026 14:42
- Use numbers.Real instead of (int, float) so numeric types like numpy
  scalars and ctypes ints are accepted, while still rejecting bool.
- Explicitly check for missing keys in amdsmi power_info and log a
  warning so monitoring failures are diagnosable instead of silently
  returning None.
- Replace the magic threshold 100000 with named constants
  _AMDSMI_MICROWATTS_PER_WATT and _AMDSMI_MICROWATTS_THRESHOLD, with a
  clearer comment explaining the µW vs W detection heuristic.
- Add unit tests covering: average_socket_power supported, fallback to
  current_socket_power on 'N/A', both unsupported -> None, missing keys
  -> None, power_limit µW->W conversion, watts passthrough, non-numeric
  power_limit -> None, and missing power_limit key -> None.
Copilot AI review requested due to automatic review settings April 27, 2026 21:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

tests/common/test_device_manager.py:31

  • test_nvidia_helper_utils is only guarded by @decorator.cuda_test (SB_TEST_CUDA), but it will still run (and fail) on hosts without NVIDIA GPUs (e.g., AMD-only or CPU-only), because dm.device_manager will not be a NvidiaDeviceManager and methods like get_device_compute_capability() return None. Gate this test on GPU vendor/availability (or split into CUDA vs ROCm tests using the existing rocm_test decorator).
@decorator.cuda_test
@mock.patch('superbench.common.utils.process.run_command')
def test_nvidia_helper_utils(mock_run_command):
    """Test util functions of nvidia_helper."""
    assert (isinstance(dm.device_manager.get_device_count(), numbers.Number))
    assert (isinstance(dm.device_manager.get_device_compute_capability(), numbers.Number))
    assert (isinstance(dm.device_manager.get_device_utilization(0), numbers.Number))
    assert (isinstance(dm.device_manager.get_device_temperature(0), numbers.Number))
    assert (isinstance(dm.device_manager.get_device_power_limit(0), numbers.Number))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/common/test_device_manager.py
Comment thread tests/common/test_device_manager.py Outdated
Hongtao Zhang added 2 commits April 27, 2026 22:04
Source:
- Extract _amdsmi_power_to_watts helper that applies the microwatts->watts
  heuristic uniformly. This makes get_device_power and get_device_power_limit
  symmetric: previously the limit was µW-corrected but the live power was
  not, so on amdsmi versions reporting power in µW the monitored gpu_power
  could be off from gpu_power_limit by a factor of 1e6.
- Make AmdDeviceManager.__del__ defensive: swallow exceptions from
  amdsmi_shut_down so interpreter-shutdown / partial-import scenarios
  cannot raise (e.g., NameError when the module global has been torn
  down). This is the proper place for the fix that the test was
  previously working around via sys.modules injection.

Tests:
- Drop the sys.modules-level rocml injection (no longer needed once
  __del__ is safe).
- Add test_amd_get_device_power_microwatts_converted to lock in the
  symmetric unit handling for get_device_power.
- Add test_amd_device_manager_lifecycle exercising __init__/__del__
  through normal construction and verifying __del__ tolerates a
  raising amdsmi_shut_down.
Copilot AI review requested due to automatic review settings May 3, 2026 05:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

with mock.patch(f'{_DM_MODULE}.rocml', rocml_mock, create=True):
assert manager.get_device_power(0) == 350
assert manager.get_device_power_limit(0) == 750

Comment on lines +12 to +13
_DM_MODULE = 'superbench.common.utils.device_manager'

@@ -3,6 +3,7 @@

"""Device Managerment Library Utility."""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants